('2025-08-21 13:34:30', 1755772470.672871)
TinyML-Autopilot Processor Comparison Analysis¶
This notebook analyzes the performance comparison between PSG and TPUSG processors across different models and parameter configurations.
Core Analysis Framework:¶
- 2 Processors: PSG vs TPUSG
- 5 Models: qwen32b, qwen14b, phi4, gemma3:27b, codestral
- 2 Conditions: With Parameters vs Without Parameters
- Goal: Determine which processor performs better under different configurations
Define Omissive Errors¶
Combine csv Files¶
Found 62 CSV files to combine Combined data saved to: /home/han/Projects/reference-benchmark-tinyml_llm/combined_tinyml_benchmark_data.csv Total rows: 1774 Total unique batch_ids: 61
'/home/han/Projects/reference-benchmark-tinyml_llm/combined_tinyml_benchmark_data.csv'
Assigning 20 Categories According to Processors, Models, and Parameters¶
Removing 83 rows due to skipped_error set.
Dataset loaded successfully!
Shape: (1691, 20)
Final shape: (1571, 16)
Processor distribution: {'psg': 827, 'tpusg': 744}
Parameter distribution: {True: 853, False: 718}
Category distribution: {'psg-qwen32b-True': 136, 'tpusg-phi4-True': 120, 'tpusg-qwen32b-True': 118, 'psg-phi4-True': 90, 'tpusg-qwen32b-False': 90, 'psg-qwen14b-False': 90, 'psg-codestral-True': 90, 'psg-phi4-False': 90, 'psg-qwen14b-True': 87, 'tpusg-codestral-True': 87, 'tpusg-phi4-False': 82, 'tpusg-qwen14b-False': 75, 'psg-codestral-False': 72, 'psg-gemma3:27b-False': 60, 'psg-qwen32b-False': 60, 'tpusg-codestral-False': 56, 'psg-gemma3:27b-True': 52, 'tpusg-gemma3:27b-True': 43, 'tpusg-gemma3:27b-False': 43, 'tpusg-qwen14b-True': 30}
| num_run | name | batch_id | status | latency | total_tokens | prompt_tokens | completion_tokens | parameters | generation_count | tags | timestamp | test_date | model_config | processor | category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | e2fa_tpu_sketch_generator | codestral_34a5_tpusg | failure | 110.76 | 13175 | 10240 | 2935 | True | 5 | ['benchmark', 'codestral:latest', 'tpu_sketch_... | 1755697779 | 08.24 | codestral | tpusg | tpusg-codestral-True |
| 1 | 2 | 4a9e_tpu_sketch_generator | codestral_34a5_tpusg | failure | 100.66 | 13827 | 10240 | 3587 | True | 5 | ['benchmark', 'codestral:latest', 'tpu_sketch_... | 1755697904 | 08.24 | codestral | tpusg | tpusg-codestral-True |
| 2 | 3 | d3b0_tpu_sketch_generator | codestral_34a5_tpusg | failure | 48.59 | 11737 | 10240 | 1497 | True | 5 | ['benchmark', 'codestral:latest', 'tpu_sketch_... | 1755698034 | 08.24 | codestral | tpusg | tpusg-codestral-True |
| 3 | 4 | 8c83_tpu_sketch_generator | codestral_34a5_tpusg | failure | 99.30 | 13766 | 10240 | 3526 | True | 5 | ['benchmark', 'codestral:latest', 'tpu_sketch_... | 1755698097 | 08.24 | codestral | tpusg | tpusg-codestral-True |
| 4 | 5 | 05c6_tpu_sketch_generator | codestral_34a5_tpusg | failure | 119.65 | 14585 | 10240 | 4345 | True | 5 | ['benchmark', 'codestral:latest', 'tpu_sketch_... | 1755698227 | 08.24 | codestral | tpusg | tpusg-codestral-True |
Grouping and Aggregating Test Batches¶
Total runs: 1571: PSG/TPUSG runs: 827/744
Models : codestral, gemma3:27b, phi4, qwen14b, qwen32b
Parameter conditions: P (853) vs NP (718)
📈 Complete Processor Comparison Matrix:
------------------------------------------------------------
processor psg tpusg
model_config parameters
codestral False 50.0 28.6
True 10.0 11.5
gemma3:27b False 53.3 2.3
True 71.2 0.0
phi4 False 77.8 95.1
True 100.0 100.0
qwen14b False 52.2 92.0
True 4.6 100.0
qwen32b False 50.0 96.7
True 39.7 33.9
🎯 PROCESSOR ADVANTAGE ANALYSIS:
------------------------------------------------------------
codestral (With params): PSG 10.0% vs TPUSG 11.5% → TPUSG (+1.5%)
codestral (Without params): PSG 50.0% vs TPUSG 28.6% → PSG (+-21.4%)
gemma3:27b (With params): PSG 71.2% vs TPUSG 0.0% → PSG (+-71.2%)
gemma3:27b (Without params): PSG 53.3% vs TPUSG 2.3% → PSG (+-51.0%)
phi4 (With params): PSG 100.0% vs TPUSG 100.0% → TIE (+0.0%)
phi4 (Without params): PSG 77.8% vs TPUSG 95.1% → TPUSG (+17.3%)
qwen14b (With params): PSG 4.6% vs TPUSG 100.0% → TPUSG (+95.4%)
qwen14b (Without params): PSG 52.2% vs TPUSG 92.0% → TPUSG (+39.8%)
qwen32b (With params): PSG 39.7% vs TPUSG 33.9% → PSG (+-5.8%)
qwen32b (Without params): PSG 50.0% vs TPUSG 96.7% → TPUSG (+46.7%)
📊 SUMMARY:
PSG wins: 4/10 configurations
TPUSG wins: 5/10 configurations
Ties: 1/10 configurations
📋 COMPLETE COMPARISON TABLE (Traditional + Weighted Success Rates):
----------------------------------------------------------------------------------------------------
| processor | model_config | parameters | total_runs | num_batches | successes | success_rate | efficiency_weighted_rate | exponential_weighted_rate | linear_weighted_rate | robust_weighted_rate | avg_tokens | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | psg | codestral | False | 72 | 2.4 | 36 | 50.00 | 25.16 | 28.16 | 36.67 | 42.92 | 10099.29 |
| 1 | psg | codestral | True | 90 | 3.0 | 9 | 10.00 | 2.67 | 2.52 | 4.22 | 5.33 | 13197.82 |
| 2 | psg | gemma3:27b | False | 60 | 2.0 | 32 | 53.33 | 30.42 | 32.13 | 39.67 | 44.00 | 9948.50 |
| 3 | psg | gemma3:27b | True | 52 | 1.7 | 37 | 71.15 | 15.58 | 11.99 | 19.62 | 29.42 | 12714.37 |
| 4 | psg | phi4 | False | 90 | 3.0 | 70 | 77.78 | 37.85 | 39.12 | 49.56 | 56.44 | 9168.93 |
| 5 | psg | phi4 | True | 90 | 3.0 | 90 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 2242.97 |
| 6 | psg | qwen14b | False | 90 | 3.0 | 47 | 52.22 | 28.26 | 30.20 | 37.78 | 42.33 | 9432.88 |
| 7 | psg | qwen14b | True | 87 | 2.9 | 4 | 4.60 | 1.38 | 1.37 | 2.07 | 2.87 | 11955.85 |
| 8 | psg | qwen32b | False | 60 | 2.0 | 30 | 50.00 | 35.53 | 36.00 | 40.00 | 42.33 | 9810.17 |
| 9 | psg | qwen32b | True | 136 | 4.5 | 54 | 39.71 | 9.62 | 8.19 | 12.94 | 18.09 | 13483.35 |
| 10 | tpusg | codestral | False | 56 | 1.9 | 16 | 28.57 | 12.08 | 12.71 | 17.14 | 19.82 | 12682.05 |
| 11 | tpusg | codestral | True | 87 | 2.9 | 10 | 11.49 | 3.35 | 3.40 | 5.75 | 6.90 | 13309.66 |
| 12 | tpusg | gemma3:27b | False | 43 | 1.4 | 1 | 2.33 | 2.33 | 2.33 | 2.33 | 2.33 | 15158.37 |
| 13 | tpusg | gemma3:27b | True | 43 | 1.4 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 15248.60 |
| 14 | tpusg | phi4 | False | 82 | 2.7 | 78 | 95.12 | 76.61 | 79.49 | 85.85 | 91.95 | 4790.41 |
| 15 | tpusg | phi4 | True | 120 | 4.0 | 120 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 2749.41 |
| 16 | tpusg | qwen14b | False | 75 | 2.5 | 69 | 92.00 | 70.16 | 72.60 | 80.00 | 85.20 | 5259.63 |
| 17 | tpusg | qwen14b | True | 30 | 1.0 | 30 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 2609.00 |
| 18 | tpusg | qwen32b | False | 90 | 3.0 | 87 | 96.67 | 72.26 | 75.12 | 83.33 | 90.11 | 5151.87 |
| 19 | tpusg | qwen32b | True | 118 | 3.9 | 40 | 33.90 | 9.11 | 8.66 | 14.41 | 17.80 | 14093.42 |
Comparison, Visualization, and Insights¶
Comprehensive visual analysis and strategic recommendations for processor selection.
📊 CREATING PROCESSOR COMPARISON VISUALIZATIONS ============================================================
✅ Processor comparison visualizations created successfully! 📊 Analysis shows clear performance differences between PSG and TPUSG processors
Five Different Success Rate Metrics¶
🎓 PROCESSOR COMPARISON INSIGHTS ============================================================ 📊 OVERALL PERFORMANCE: Traditional Success Rate: PSG Average: 50.9% TPUSG Average: 56.0% Overall Winner: TPUSG (+5.1%) Efficiency Weighted: PSG Average: 28.6% TPUSG Average: 44.6% Winner: TPUSG (+15.9%) Exponential Weighted: PSG Average: 29.0% TPUSG Average: 45.4% Winner: TPUSG (+16.5%) Linear Weighted: PSG Average: 34.3% TPUSG Average: 48.9% Winner: TPUSG (+14.6%) Robust Weighted: PSG Average: 38.4% TPUSG Average: 51.4% Winner: TPUSG (+13.0%) ⚙️ PARAMETER EFFECTS: Traditional Success Rate: PSG: With params 45.1% vs Without params 56.7% (Effect: -11.6%) TPUSG: With params 49.1% vs Without params 62.9% (Effect: -13.9%) Efficiency Weighted: PSG: With params 25.9% vs Without params 31.4% (Effect: -5.6%) TPUSG: With params 42.5% vs Without params 46.7% (Effect: -4.2%) Exponential Weighted: PSG: With params 24.8% vs Without params 33.1% (Effect: -8.3%) TPUSG: With params 42.4% vs Without params 48.5% (Effect: -6.0%) Linear Weighted: PSG: With params 27.8% vs Without params 40.7% (Effect: -13.0%) TPUSG: With params 44.0% vs Without params 53.7% (Effect: -9.7%) Robust Weighted: PSG: With params 31.1% vs Without params 45.6% (Effect: -14.5%) TPUSG: With params 44.9% vs Without params 57.9% (Effect: -12.9%) ⚙️ PARAMETER USAGE STRATEGY: PSG: Parameters hurt performance (-11.6%) TPUSG: Parameters hurt performance (-13.9%)
📊 TRADITIONAL vs WEIGHTED SUCCESS RATES COMPARISON ======================================================================
✅ Weighted success rate analysis complete! 📈 The efficiency-weighted metrics show TPUSG has a 15.9% advantage vs 5.1% traditional.
Success Rates per Model, TPUSG and PSG, w and w/o parameters¶
📊 DETAILED PROCESSOR COMPARISON BY MODEL & PARAMETERS ======================================================================
📈 SUMMARY COMPARISON TABLE:
--------------------------------------------------------------------------------
Model PSG+Params PSG TPUSG+Params TPUSG Best_Config
codestral 10.0% 50.0% 11.5% 28.6% PSG
gemma3:27b 71.2% 53.3% 0.0% 2.3% PSG+Params
phi4 100.0% 77.8% 100.0% 95.1% PSG+Params
qwen14b 4.6% 52.2% 100.0% 92.0% TPUSG+Params
qwen32b 39.7% 50.0% 33.9% 96.7% TPUSG
PSG vs TPUSG From Multi-dimensions¶
📊 RUN-LEVEL STANDARD DEVIATION & STATISTICAL ANALYSIS
================================================================================
Removing 83 rows due to skipped_error set.
Dataset loaded successfully! Shape: (1691, 20)
Cleaned dataset shape: (1571, 16)
Processor distribution: {'psg': 827, 'tpusg': 744}
Date range: 2025-07-29 11:00:38 to 2025-08-20 14:52:35
📊 RUN-LEVEL DATASET STATISTICS:
--------------------------------------------------
PSG Processor:
Total runs: 827
Unique batches: 27
Avg runs per batch: 30.6
Success rate: 49.5%
Successful runs: 409
Failed runs: 418
Avg generation count: 3.82
Generation count range: [1, 5]
Avg generations for success: 2.62
TPUSG Processor:
Total runs: 744
Unique batches: 29
Avg runs per batch: 25.7
Success rate: 60.6%
Successful runs: 451
Failed runs: 293
Avg generation count: 2.99
Generation count range: [1, 5]
Avg generations for success: 1.70
📈 2. RUN-LEVEL METRICS CALCULATION
============================================================
📊 COMPREHENSIVE RUN-LEVEL STATISTICS:
======================================================================
Success Rate (Run Level):
------------------------------------------------------------
PSG: 49.5% ± 50.0% (n=827 runs)
95% CI: [46.0%, 52.9%]
CV: 1.012
TPUSG: 60.6% ± 48.9% (n=744 runs)
95% CI: [57.1%, 64.1%]
CV: 0.807
Generation Count (Attempts per Run):
------------------------------------------------------------
PSG: 3.821 ± 1.605 (n=827 runs)
95% CI: [3.712, 3.930]
Range: [1.000, 5.000]
Median (IQR): 5.000 [2.000, 5.000]
CV: 0.420
TPUSG: 2.991 ± 1.836 (n=744 runs)
95% CI: [2.859, 3.123]
Range: [1.000, 5.000]
Median (IQR): 3.000 [1.000, 5.000]
CV: 0.614
Total Tokens per Run:
------------------------------------------------------------
PSG: 10292.073 ± 4542.158 (n=827 runs)
95% CI: [9982.498, 10601.647]
Range: [2220.000, 17991.000]
Median (IQR): 12283.000 [5859.000, 13798.000]
CV: 0.441
TPUSG: 8733.617 ± 5582.263 (n=744 runs)
95% CI: [8332.492, 9134.742]
Range: [2499.000, 16582.000]
Median (IQR): 8753.500 [2750.000, 15018.500]
CV: 0.639
Latency per Run (seconds):
------------------------------------------------------------
PSG: 73.421 ± 45.820 (n=827 runs)
95% CI: [70.298, 76.544]
Range: [10.220, 263.980]
Median (IQR): 67.560 [38.270, 103.505]
CV: 0.624
TPUSG: 77.298 ± 61.343 (n=744 runs)
95% CI: [72.890, 81.706]
Range: [12.570, 244.300]
Median (IQR): 56.790 [19.735, 140.567]
CV: 0.794
🔬 3. STATISTICAL SIGNIFICANCE TESTING (RUN LEVEL)
============================================================
🔬 STATISTICAL TEST RESULTS:
======================================================================
Success Rate (Run Level):
------------------------------------------------------------
Sample sizes: PSG=827, TPUSG=744
PSG: 49.5% ± 50.0% (CV: 1.012)
TPUSG: 60.6% ± 48.9% (CV: 0.807)
Difference: +11.2% (TPUSG - PSG)
Chi-square: χ²=19.249, p=0.0000, Significant: Yes
Effect size (Cohen's h): 0.225 (medium)
Interpretation: TPUSG outperforms PSG
Generation Count (Attempts per Run):
------------------------------------------------------------
Sample sizes: PSG=827, TPUSG=744
PSG: 3.821 ± 1.605 (CV: 0.420)
TPUSG: 2.991 ± 1.836 (CV: 0.614)
Difference: -0.830 (TPUSG - PSG)
Mann-Whitney U: U=381602, p=0.0000, Significant: Yes
T-test: t=9.497, p=0.0000, Significant: Yes
Effect size (Cohen's d): -0.483 (medium)
Interpretation: TPUSG underperforms PSG by medium effect
Total Tokens per Run:
------------------------------------------------------------
Sample sizes: PSG=827, TPUSG=744
PSG: 10292.073 ± 4542.158 (CV: 0.441)
TPUSG: 8733.617 ± 5582.263 (CV: 0.639)
Difference: -1558.456 (TPUSG - PSG)
Mann-Whitney U: U=309222, p=0.8605, Significant: No
T-test: t=6.028, p=0.0000, Significant: Yes
Effect size (Cohen's d): -0.308 (medium)
Interpretation: TPUSG underperforms PSG by medium effect
Latency per Run (seconds):
------------------------------------------------------------
Sample sizes: PSG=827, TPUSG=744
PSG: 73.421 ± 45.820 (CV: 0.624)
TPUSG: 77.298 ± 61.343 (CV: 0.794)
Difference: +3.877 (TPUSG - PSG)
Mann-Whitney U: U=310380, p=0.7606, Significant: No
T-test: t=-1.407, p=0.1597, Significant: No
Effect size (Cohen's d): 0.072 (small)
Interpretation: TPUSG outperforms PSG by small effect
📈 4. RUN-LEVEL VARIANCE ANALYSIS VISUALIZATION
============================================================
STATISTICAL SUMMARY (RUN LEVEL):
--------------------------------
Significant Differences (p < 0.05):
✓ Success Binary: Chi-square p=0.0000 (Cohen's h: 0.225, medium)
✓ Generation Count: Mann-Whitney p=0.0000 (Cohen's d: -0.483, medium)
✗ Total Tokens: Mann-Whitney p=0.8605 (Cohen's d: -0.308, medium)
✗ Latency: Mann-Whitney p=0.7606 (Cohen's d: 0.072, small)
Sample Sizes:
Total runs: 1571
PSG runs: 827 (52.6%)
TPUSG runs: 744 (47.4%)
Unique batches: 56
📋 5. COMPREHENSIVE RUN-LEVEL ANALYSIS SUMMARY
======================================================================
🎯 RUN-LEVEL PERFORMANCE SUMMARY:
--------------------------------------------------
Dataset Overview:
Total runs analyzed: 1571
Unique batches: 56
Average runs per batch: 28.1
Date range: 2025-07-29 to 2025-08-20
📊 KEY FINDINGS:
------------------------------
Success Rates (Run Level):
PSG: 49.5% (409/827 runs)
TPUSG: 60.6% (451/744 runs)
Difference: +11.2% (TPUSG - PSG)
Generation Efficiency (Attempts per Run):
PSG: 3.82 ± 1.60
TPUSG: 2.99 ± 1.84
More Efficient: TPUSG
📊 CONSISTENCY COMPARISON (CV):
----------------------------------------
Success Rate: TPUSG more consistent (PSG: 1.012, TPUSG: 0.807)
Generation Count: PSG more consistent (PSG: 0.420, TPUSG: 0.614)
Total Tokens per Run: PSG more consistent (PSG: 0.441, TPUSG: 0.639)
Latency per Run: PSG more consistent (PSG: 0.624, TPUSG: 0.794)
🔬 STATISTICAL SIGNIFICANCE SUMMARY:
--------------------------------------------------
Metrics with significant differences: 2/4
Statistical power: High (large run-level sample sizes)
✅ RUN-LEVEL ANALYSIS COMPLETE!
📊 Analysis based on 1571 individual runs as primary data points
🎯 Each run represents one complete test execution with success/failure outcome
📈 Generation count reflects efficiency of achieving success within each run
🔬 Statistical tests performed at the most granular level for maximum power
Temporal Analysis¶
📈 BATCH SUCCESS RATE TIMELINE ANALYSIS ====================================================================== 🔄 Processing batch timeline data... 📊 Analyzed 56 unique batches 📅 Date range: 2025-07-29 11:00 to 2025-08-20 13:49 📈 CREATING BATCH TIMELINE VISUALIZATION...
📊 BATCH TIMELINE SUMMARY: -------------------------------------------------- Total unique batches: 56 Date range: 12 unique days Processor breakdown: PSG batches: 27 (avg success: 47.1%) TPUSG batches: 29 (avg success: 60.2%) 🏆 Best performing batch: qwen32b_4e11_tpusg: 100.0% (tpusg, qwen32b) 📉 Worst performing batch: qwen14b_33b8_psg: 0.0% (psg, qwen14b) ✅ Batch timeline analysis complete! 📈 Timeline shows 56 batches across 21 processor-date combinations
📊 DETAILED BATCH TIMELINE: WITH/WITHOUT PARAMETERS ANALYSIS ====================================================================== 🔄 Creating parameter-separated visualizations...
📊 PARAMETER-SPECIFIC BATCH SUMMARY: ------------------------------------------------------------ 📈 WITH Parameters: Total batches: 28 PSG: 15 batches (avg: 39.9%) TPUSG: 13 batches (avg: 43.8%) Best with params: qwen14b_3193_tpusg (100.0%) Worst with params: qwen14b_33b8_psg (0.0%) 📉 WITHOUT Parameters: Total batches: 28 PSG: 12 batches (avg: 56.2%) TPUSG: 16 batches (avg: 73.5%) Best without params: qwen32b_4e11_tpusg (100.0%) Worst without params: gemma3:27b_04fd_tpusg (0.0%) ⚙️ PARAMETER EFFECT ANALYSIS: -------------------------------------------------- qwen32b: PSG: 41.2% (with) vs 50.0% (without) → -8.8% effect TPUSG: 33.7% (with) vs 96.5% (without) → -62.8% effect codestral: PSG: 10.0% (with) vs 54.8% (without) → -44.8% effect TPUSG: 11.7% (with) vs 28.2% (without) → -16.5% effect phi4: PSG: 100.0% (with) vs 73.3% (without) → +26.7% effect TPUSG: 100.0% (with) vs 96.3% (without) → +3.7% effect qwen14b: PSG: 4.7% (with) vs 52.2% (without) → -47.5% effect TPUSG: 100.0% (with) vs 89.0% (without) → +11.0% effect gemma3:27b: PSG: 73.8% (with) vs 53.3% (without) → +20.5% effect TPUSG: 0.0% (with) vs 3.3% (without) → -3.3% effect ✅ Parameter-separated batch timeline analysis complete! 📊 Analysis reveals parameter effects across 56 batches in timeline order
Temporal Trends¶
📈 20-CONFIGURATION TIMELINE WITH ORGANIZED LINES ================================================================================ 🚀 Starting 20-configuration timeline analysis... 🔄 Processing 20-configuration batch timeline data... 📊 Analyzed 56 unique batches 📅 Date range: 2025-07-29 11:00 to 2025-08-20 13:49 🔧 Total configurations found: 20 📋 ORGANIZING 20 CONFIGURATIONS: ============================================================ 🔵 PSG Group: 10 configurations 🔴 TPUSG Group: 10 configurations 📈 CREATING 20-LINE ORGANIZED TIMELINE... 📊 Creating timeline with 10 PSG + 10 TPUSG lines 🎨 Plotting PSG configuration lines... 🎨 Plotting TPUSG configuration lines... 📊 20-CONFIGURATION TIMELINE SUMMARY ================================================== 🔵 PSG CONFIGURATIONS: ----------------------------------- phi4 WithParams 100.0% ( 2b) gemma3:27b WithParams 73.8% ( 2b) phi4 NoParams 73.3% ( 2b) codestral NoParams 54.8% ( 3b) gemma3:27b NoParams 53.3% ( 2b) qwen14b NoParams 52.2% ( 3b) qwen32b NoParams 50.0% ( 2b) qwen32b WithParams 41.2% ( 5b) codestral WithParams 10.0% ( 3b) qwen14b WithParams 4.7% ( 3b) 🔴 TPUSG CONFIGURATIONS: ----------------------------------- phi4 WithParams 100.0% ( 3b) qwen14b WithParams 100.0% ( 1b) qwen32b NoParams 96.5% ( 4b) phi4 NoParams 96.3% ( 2b) qwen14b NoParams 89.0% ( 6b) qwen32b WithParams 33.7% ( 4b) codestral NoParams 28.2% ( 2b) codestral WithParams 11.7% ( 3b) gemma3:27b NoParams 3.3% ( 2b) gemma3:27b WithParams 0.0% ( 2b) 🏆 OVERALL AVERAGES: ------------------------- PSG Average: 51.3% TPUSG Average: 55.9% Winner: TPUSG (+4.5%)
✅ 20-LINE TIMELINE COMPLETE! ================================================== 📈 Successfully plotted 20 configuration lines 🔵 PSG lines: 10 (blue colors) 🔴 TPUSG lines: 10 (red colors) ━ Solid lines: With parameters ╌ Dashed lines: Without parameters 📊 Each line shows temporal evolution of one specific configuration
📊 Current Data Structure Analysis ================================================== ✅ batch_timeline_data exists with 56 records Date range: 2025-07-29 11:00:38 to 2025-08-20 13:49:39 Unique configurations: 20
📈 LOCAL SLOPE AGGREGATION ANALYSIS
================================================================================
🔄 Calculating slopes for each configuration...
📊 Calculated 36 slopes from 20 configurations
📅 Slope date range: 2025-07-29 to 2025-08-20
⏱️ Time spans: 0.01 to 20.17 days
📈 Slope range: -150.83 to 2568.62 %/day
📋 SLOPES SUMMARY BY PROCESSOR:
--------------------------------------------------
🔵 PSG: 17 slopes, mean: 0.396 %/day
🔴 TPUSG: 19 slopes, mean: 137.491 %/day
📊 SAMPLE SLOPES:
----------------------------------------
config_id start_date end_date time_span_days slope
tpusg_False_qwen32b 2025-07-29 2025-07-29 0.047130 223.426326
tpusg_False_qwen32b 2025-07-29 2025-08-03 5.467604 -0.609042
tpusg_False_qwen32b 2025-08-03 2025-08-04 0.594931 5.597292
tpusg_False_codestral 2025-07-29 2025-08-12 14.257488 0.718920
tpusg_True_qwen32b 2025-07-30 2025-08-03 4.545972 -8.647215
tpusg_True_qwen32b 2025-08-03 2025-08-04 0.263056 35.391763
tpusg_True_qwen32b 2025-08-04 2025-08-04 0.399456 -14.669950
psg_True_qwen32b 2025-07-30 2025-08-03 4.529919 -0.735112
psg_True_qwen32b 2025-08-03 2025-08-04 0.262037 62.014134
psg_True_qwen32b 2025-08-04 2025-08-04 0.401898 -57.029374
============================================================
🔄 Aggregating slopes by adjacent day pairs...
📅 Found 23 unique dates in slope data
📊 Created 22 aggregated day-pair slope measurements
🔵 PSG: 21 day-pairs with data
🔴 TPUSG: 22 day-pairs with data
📈 PSG slope range: -4.12 to 23.11 %/day
📈 TPUSG slope range: -1.97 to 7.04 %/day
📊 AGGREGATED SLOPES PREVIEW:
--------------------------------------------------
day_pair_start day_pair_end psg_mean_slope tpusg_mean_slope psg_slope_count tpusg_slope_count
2025-07-29 2025-07-30 NaN 0.054939 0 2
2025-07-30 2025-07-31 -0.496120 -1.330875 4 6
2025-07-31 2025-08-01 -0.496120 -1.330875 4 6
2025-08-01 2025-08-02 -0.496120 -1.330875 4 6
2025-08-02 2025-08-03 -0.496120 -1.330875 4 6
2025-08-03 2025-08-04 15.191192 7.043344 4 6
2025-08-04 2025-08-05 -4.116323 0.317752 4 4
2025-08-05 2025-08-06 -1.094686 0.211835 6 6
2025-08-06 2025-08-07 -1.094686 0.211835 6 6
2025-08-07 2025-08-08 -1.094686 0.211835 6 6
📈 CREATING AGGREGATED SLOPE VISUALIZATION... ============================================================ 🎨 Creating aggregated slope trends visualization... 📊 AGGREGATED SLOPE ANALYSIS SUMMARY ============================================================ 🎯 METHODOLOGY: 1. Calculate slopes between consecutive points for each config 2. For each day-pair, average slopes of all configs covering it 3. Plot aggregated trend (NOT success rate, but rate of change) 📈 SLOPE TREND COMPARISON: ----------------------------------- PSG Average Slope: 2.36 ± 6.11 %/day TPUSG Average Slope: -0.00 ± 1.71 %/day Difference: -2.36 %/day 🏆 OVERALL TRENDS: ------------------------- PSG Trend: Improving (+2.36 %/day) TPUSG Trend: Declining (-0.00 %/day) 📊 DATA COVERAGE: -------------------- PSG day-pairs: 21/22 TPUSG day-pairs: 22/22 Date range: 2025-07-29 to 2025-08-20 Avg configs per day-pair: PSG: 4.6 configurations TPUSG: 4.8 configurations
✅ AGGREGATED SLOPE ANALYSIS COMPLETE! ================================================== 📊 Successfully analyzed 22 day-pair measurements 🎯 Key insight: Y-axis shows rate of change (%/day), not success rate 📈 Positive slopes = performance improving over time 📉 Negative slopes = performance declining over time ➖ Zero slope = stable performance